In [1]:
import sys;sys.path.insert(0, "..") # For making revscoring accessible when running this from revscoring/ipython
from revscoring.features import revision, diff, Feature, modifiers
from revscoring.datasources.revision import text as revision_text
from revscoring.extractors import APIExtractor
from mw import api

Feature extractor setup

This line constructs a "feature extractor" that uses Wikipedia's API to solve dependencies.


In [2]:
extractor = APIExtractor(api.Session("https://en.wikipedia.org/w/api.php"))


WARNING:mw.api.session:Sending requests with default User-Agent.  Set 'user_agent' on api.Session to quiet this message.

Using the extractor to extract features

The following line demonstrates a simple feature extraction. Note that we wrap the call in a list() because it returns a generator.


In [3]:
list(extractor.extract(123456789, [diff.chars_added]))


Out[3]:
[6]

Defining a custom feature

The next block defines a new feature and sets the dependencies to be two other features: diff.chars_added and revision.chars. This feature represents the proportion of characters in the current version of the page that the current edit is responsible for adding.


In [4]:
chars_added_ratio = Feature("diff.chars_added_ratio", 
                            lambda a,c: a/max(c, 1), # Prevents divide by zero
                            depends_on=[diff.chars_added, revision.chars],
                            returns=float)
list(extractor.extract(123456789, [chars_added_ratio]))


Out[4]:
[0.0002550369803621525]

There's easier ways that we can do this though. I've overloaded simple mathematical operators to allow you to do simple math with feature and get a feature returned. This code roughly corresponds to what's going on above.


In [5]:
chars_added_ratio = diff.chars_added / modifiers.max(revision.chars, 1) # Prevents divide by zero
list(extractor.extract(123456789, [chars_added_ratio]))


Out[5]:
[0.0002550369803621525]

Using datasources

There's a also a set of datasources that are part of the dependency injection system. See revscoring/revscoring/datasources. I'll need to rename the diff datasource when I import it because of the name clash. FWIW, you usually don't use features and datasources in the same context, so there's some name overlap.


In [6]:
from revscoring.datasources import diff as diff_datasource
list(extractor.extract(662953550, [diff_datasource.added_segments]))


Out[6]:
[['Ideology and policies',
  'Political scientists [[Robert Ford]] and [[Matthew Goodwin]] characterised UKIP as "a radical right party".{{sfn|Ford|Goodwin|2014|p=13}}\n\n',
  '{{fact}}',
  '{{fact}}',
  '{{fact}}',
  '{{fact}}',
  '{{fact}}',
  '{{fact}}',
  '{{fact}}',
  '{{fact}}',
  '{{fact}}',
  '{{fact}}']]

OK. Let's define a new feature for counting the number of templates added. I'll make use of mwparserfromhell to do this. See the docs.


In [7]:
import mwparserfromhell as mwp

templates_added = Feature("diff.templates_added", 
                          lambda add_segments: sum(len(mwp.parse(s).filter_templates()) > 0 for s in add_segments),
                          depends_on=[diff_datasource.added_segments],
                          returns=int)
list(extractor.extract(662953550, [templates_added]))


Out[7]:
[11]

Debugging

There's some facilities in place to help you make sense of issues when they arise. The most important is the draw function.


In [8]:
from revscoring.dependent import draw
draw(templates_added)


 - <diff.templates_added>
	 - <diff.added_segments>
		 - <diff.operations>
			 - <parent_revision.text>
			 - <revision.text>

In the tree structure above, you can see how our new feature depends on "diff.added_segments" which depends on "diff.operations" which depends (as you might imaging) on the current and parent revision. Other features are a bit more complicated.


In [9]:
draw(diff.added_badwords_ratio)


 - <((diff.badwords_added / max(diff.words_added, 1)) / max((parent_revision.badwords / max(parent_revision.words, 1)), 0.001))>
	 - <(diff.badwords_added / max(diff.words_added, 1))>
		 - <diff.badwords_added>
			 - <is_badword>
			 - <diff.added_words>
				 - <diff.added_segments>
					 - <diff.operations>
						 - <parent_revision.text>
						 - <revision.text>
		 - <max(diff.words_added, 1)>
			 - <diff.words_added>
				 - <diff.added_words>
					 - <diff.added_segments>
						 - <diff.operations>
							 - <parent_revision.text>
							 - <revision.text>
			 - <1>
	 - <max((parent_revision.badwords / max(parent_revision.words, 1)), 0.001)>
		 - <(parent_revision.badwords / max(parent_revision.words, 1))>
			 - <parent_revision.badwords>
				 - <is_badword>
				 - <parent_revision.words>
					 - <parent_revision.text>
			 - <max(parent_revision.words, 1)>
				 - <parent_revision.words>
					 - <parent_revision.words>
						 - <parent_revision.text>
				 - <1>
		 - <0.001>

In [ ]: